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Abstract 

Motivated by genetic association studies of pleiotropy, we propose here a 
Bayesian latent variable approach to jointly study multiple outcomes or pheno- 
types. The proposed method models both continuous and binary phenotypes, 
and it accounts for serial and familial correlations when longitudinal and pedi- 
gree data have been collected. We present a Bayesian estimation method for 
the model parameters, and we develop a novel MCMC algorithm that builds 
upon hierarchical centering and parameter expansion techniques to efficiently 
sample the posterior distribution. We discuss phenotype and model selection in 
the Bayesian setting, and we study the performance of two selection strategies 
based on Bayes factors and spike-and-slab priors. We evaluate the proposed 
method via extensive simulations and demonstrate its utility with an applica- 
tion to a genome-wide association study of various complication phenotypes 
related to type 1 diabetes. 

Keywords: Bayesian inference, Genetics, Latent variable, Markov chain Monte 
Carlo, Path Sampling, Pleiotropy 

1 Introduction 

Pleiotropy occurs when a single genetic factor influences multiple continuous or binary 
phenotypes, and it is present in many genetic studies of complex human traits such as 
diabetes, hypertension and cardiovascular diseases. In genetic studies of complications 
or secondary manifestations of a disease, it is often believed that there are common 
genetic risk factors for the different phenotypes. In other cases, the primary and 
often conceptual phenotype (e.g. disease severity) may not be directly measured or be 
characterized by one single phenotype, and a set of surrogate response variables must 
instead be used. These response variables (phenotypes or outcomes) are mutually 
correlated as they measure the underlying trait from different perspectives. In order 
to take into account all information and to increase statistical efficiency, it is desirable 
to model these outcomes jointly. 



2 



An added characteristic of many emerging large-scale genetic studies is the col- 
lection of repeated measures over time in correlated individuals (family data). For 
example, the ongoing T2D-GENES (Type 2 Diabetes Genetic Exploration by Next- 
generation sequencing in multi-Ethnic Samples) consortium study includes longitudi- 
nal measures of various T2D-revelant phenotypes (e.g. blood glucose levels and blood 
pressures) and other covariates (e.g. sex, body mass index and medication history), 
and these measures are collected for 2,500 individuals from 85 Mexican- American 
families. Similarly, the DCCT (Diabetes Control and Complications Trial) study in- 
cludes longitudinal measures of various type 1 diabetes (T1D) complications, which 
we analyze in this paper. 

The longitudinal family studies combine the features of longitudinal studies in 
independent individuals and studies using single-time-point phenotype measures in 
families, providing more information about the genetic and environmental factors as- 



sociated with the traits of interest than cross-sectional studies (Burton et al. 2005). 
However, joint modeling of multiple phenotypes using longitudinal family data in- 
volves non-trivial statistical and computational challenges because of the complex 
correlations that exist between different phenotypes (the phenotypical correlation), 
between repeated measures from the same phenotype (the serial correlation) and 
between individuals within the same family (the familial correlation). 

Robust and powerful methods for the study of pleiotropy are under-developed in 
the literature due to data complexities that include a) the phenotypes of interest can 
be continuous, discrete or both, b) the joint effect of covariates on the multiple pheno- 
types is difficult to specify, and c) the familial, serial and other correlations are often 
present in the data as discussed above. There are a number of approaches proposed 



for cross-sectional data. For instance, Xu et al. (2003) extended the standard linear 



combination test to incorporate data-driven weighting factors, Weller et al. (1996) 
applied the principal component analysis to the multiple traits of interest to obtain 
independent canonical variables and then conducted univariate quantitative trait lo- 



cus (QTL) analyses, and Lange and Whittaker (2001) developed a QTL-mapping 
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method based on generalized estimating equations. 

Here we propose to use the latent variable (LV) methodology to jointly study 
multiple correlated phenotypes in the presence of serial and familial correlations. 
The formulation of a latent variable model (LVM) relies on postulating the effect of 
a random variable that is not observed by the researchers but is assumed to exert 
an important influence on the set of observed variables (also known as the manifest 



variables) and thus induces correlations among them (Bartholomew et al. 2011). In 
the context of pleiotropy studies, the manifest variables are the multiple observed 
phenotypes, which jointly inform the latent variable that represents the underlying 
conceptual disease status or severity. 

The LV methodology has been widely used in many scientific fields including 
economics, psychology and social sciences, and it is becoming increasingly attractive 



for genetic studies. For example, O'Hara et al. (2010) proposed a LV approach for 



the analysis of multivariate quantitative trait loci, Tayo et al. (2008) applied a factor 
analysis (a sub-type of LVM) to find latent common genetic components of obesity 



traits, and Nock et al. (2009) used the factor analysis for a metabolic syndrome study. 
Initial applications of LVM focused on reducing the number of manifest variables to a 



smaller number of latent outcomes. Sammel and Ryan (1996) and Sammel and Ryan 



(1997) extended the LVM to allow covariates to have effects on both the manifest and 



latent variables. Roy and Lin (2000) discussed a LV approach for longitudinal data 
with continuous outcomes. 

The proposed LVM consists of two parts. The first part models the relationship 
between the manifest variables and the LV to characterize the within-subject cor- 
relation among the different outcomes. The second part uses a linear mixed effect 
model to investigate the effect of a genetic marker and other covariates on the LV, 
accounting for the serial and familial correlations. Direct effects of covariates on the 
manifest variables are also allowed in the first part of model. 

The paper makes a number of original contributions. The proposed LV method 



generalizes the work of Roy and Lin (2000) to longitudinal family data with binary 
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and continuous responses. The Bayesian formulation can be desirable in practice 
as it offers a principled way to incorporate prior information, often available in ge- 
netic studies, and to perform finite sample likelihood-based inference. However, the 
Bayesian model raises important computational challenges because the sampling al- 
gorithm required to study the posterior is inefficient when it is applied in its standard 
form. We introduce a novel algorithm that relies on the hierarchical centering and 



parameter expansion techniques (Gelfand 1995; Liu and Wu 1999; Meng and van 



Dyk 1999) to improve computational efficiency. 



The rest of the paper is organized as follows. Section 2 introduces the details 
of the LVM and discusses the consequences of naively ignoring the family structure 
in the data. Section 3 presents a Bayesian estimation for the model parameters and 
a novel MCMC algorithm designed to sample the posterior distribution efficiently. 
Section 4 discusses phenotype selection and model selection in the Bayesian setting. 
Section 5 shows results from extensive simulation studies, and Section 6 applies the 
proposed method to a genetic study of T1D complications. Section 7 concludes with 
recommendations and further discussions. 



2 Latent Variable Model for Longitudinal Family 
data. 

2.1 The Statistical Model 

Let Y cit = {ydtii ■ ■ ■ iVcitj)' be the J x 1 vector of outcomes/phenotypes/manifest 
variables measured at the t th time on the i th individual from the c th family /cluster for 
c = 1, 2, . . . , C, i — 1, 2, . . . , N c , t — 1, 2, . . . , T, where C denotes the total number of 
families, N c is the number of individuals within the c th family, and T is the total num- 
ber of repeated measurements. Among the J outcomes, Y c cit = (y^. itl , . . . , UcitjJ' are 
continuous and Y b cit = {y h cit j 1+ i-, • • • , Ucuj)' are binary. Let U cit be the LV representing 
the conceptual disease severity which aggregates the partial information brought by 
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each of the J phenotypes. 

In the first part of the LVM, a continuous phenotype y c is linked to the latent 
trait U via a linear mixed model 



V'atj = Ay + + \jUat + bdj + e cit3 , (2.1) 



where e cit j ~ iV(0, erj), is a px-dimensional vector of covariates that have direct 
effects on the phenotype (direct fixed-effect covariates) and Xj is the factor loading 
that represents the effect of the LV on the j phenotype. When all XjS are equal to 1, 



model (2.1) is reduced to a mixed effect model. The random component b C ij captures 
the family-specific within- subject correlations over time. We assume b ci j ~ iV(0,T 2 ), 
and e c uj and b ci j are mutually independent for c = 1, . . . , C, i — 1, . . . , N c , t — 1, . . . , T 
and j = 1, . . . , J. 

If a phenotype is binary, a generalized linear mixed model is assumed, 

Vatj = Poj + Wj it f3j + X 3 U clt + b^, (2.2) 

with a probit link, 

E [yitjWcit, b cij ] = p(y b citJ = l\U cit , b cij ) = <5>(ricitj). (2.3) 

We choose a probit link instead of a logit link to gain computational efficiency Specif- 
ically, we take advantage of the well-known representation of the probit model based 
on a normally distributed LV. If y b cit j ~ N(r] cit j, 1) is the Gaussian variable underlying 



the binary response y cit j, then (|2.3|) is recovered when y citj = 1^ 



Jdtj ~ x {y b citj >0} 



The second part of the LVM assess the effect of X cit , a p 2 ~ dimensional vector 
of variables that of primary interest (e.g. a genetic marker and possibly additional 
clinical factors), on the latent variable U via a linear mixed model, 

U cit = Xj it a + Zj it a c + Q T cit d ci + e cit , (2.4) 
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where e c u ~ N(0,tp 2 ). Elements in X are also called indirect fixed-effect covariates 
because their effects on a phenotype of interest Y is induced via the effect of the latent 
variable U on Y modelled in first part of the LVM above. Pleiotropy is detected if 
the element of a corresponding to the genetic marker is found to be significant. We 
assume a c G R 9lXl are the family-specific random effects and Z c u is the corresponding 
vector of covariates. Similarly, d C i e R 92Xl represents the subject-specific random 
effects with Q cit its associated covariate vector. We assume that a c ~ MVN q \(Q, Ha), 
d C i ~ MViVg2(0, E_d) and all random effects are independent of the e C j t . 



2.2 Model Identification Restrictions 



The following modification of equation (2.1) 



Vcitj = Ay + WStPj + XjK^KUdt + b clJ + e citj , (2.5) 

where K is an arbitrary nonzero number, suggests that, without any restriction on A or 
the variance of U c u, an infinite number of equivalent models can be created. A similar 
phenomenon appears in the binary phenotype case. In order to avoid unidentifiability, 
we assume that the variance of U c u is equal to 1 and that Xj is non-negative. Because 
we allow covariate effects in both parts of the LVM, we assume that the two sets of 



covariates are disjoint and equation (2.4) does not contain the intercept 



2.3 Effects of Ignoring Familial Correlation 



Individuals from the same family are genetically related resulting in correlation be- 
tween their latent disease status. In practice, to reduce the analytic complexity and 
computation burden, one may choose to assume sample independence and apply ex- 



isting methods (e.g. Roy and Lin (2000)). However, ignoring the family structure will 
cause incorrect inference for the model parameters. To see this clearly, we assume a 
simplified case where the phenotypes of interest are all continuous and there are no 
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repeated measures. The two parts of LVM are reduced to 



Vdj = Ay + W^fy + XjUd + e cijl and U ci = X^a + Z^a c + e c 



where c = 1, . . . ,C, i = 1, . . . , N c and j = 1, . . . , J, with independent error terms 
e cij ~ N(0, of) and e ci ~ N(0, 1), \j > and a c ~ iV(0, S A ). 

The variance of the j th response for individual % from family c can be decomposed 
in terms of the model parameters as: 



Var(y, 



cijj 



a +A + \i 



3 ci 



(2.6) 



Suppose we ignore the familial correlation in the data and propose the following model 

Vhj = Ay + W% (3j + XjU h + e hj , and U h = X%a + e h , 

where h = 1, . . . , N is the individual's index and N is the total sample size. In this 
case, the variance of the j th response for individual h is decomposed as 



Var(y hj ) = a] + X]. 



(2.7) 



By equating (2.6) and (2.7), we find that A| = (Z^E^Zd + 1)A| and thus it is easy 
to see that 



A, 



a 



a 



a < a. 



A, V(Z^ A Z ct + 1) 
Therefore, ignoring familial correlation can lead to significant underestimation of a, 
the effect of a genetic marker on the LV of interest. However, the omission of existing 
familial correlation leads to overestimation of A, the effect of the LV on the phenotypes 
of interest in the first part of the LVM, as demonstrated in the simulation studies of 
Section 5.2 below. This is consistent with what's reported in the statistical genetics 



literature (e.g. Thornton and McPeek (2010)) 



In the longitudinal setting, as seen from equation (2.4), ignoring the familial cor- 



S 



relation will also result in biased estimation of A and a, as well as the serial correlation 
Simulation results reported in Section [5] support these conclusions. However, par- 
ticular to the longitudinal setting, if covariates Z are not present, the error caused 
by ignoring the familial correlation can be absorbed into the serial correlation and 
thus only will be incorrectly estimated. Nevertheless, there is still some loss of 
efficiency in the estimation of A and a in this case. 



3 Parameter Estimation via Bayesian Method 



Traditional solutions for LVM, including the popular software such as LISREL ( Joreskog 



and Sorbom, 1996) and MPLUS (Muthen and Asparouhov, 2011), rely on frequen 



tist methods. The development of modern computational algorithms, in particular 
Markov chain Monte Carlo (MCMC), enables us to use LVM for dependent data 
within the Bayesian paradigm. The Bayesian approach also offers a principled ap- 
proach to produce finite sample likelihood-based inference and to incorporate any 
available prior information which, in a genetic analysis setup, can be considerable. 



3.1 Bayesian Model Setup 

The data in our model contain the observed continuous or binary outcomes Y, 
the direct fixed-effect covariates W, the indirect fixed-effect covariates X, and the 
random-effect covariates Z and Q. The vector of parameters of interest is = 
(A), A a, A, r 2 , a 2 , E A , E D )' where f3 = (An, • • • , M' , (3 = • • • , 0j)' where = 
(fa, . . . ,P jPl )', a = (a u ...,a P2 )', A = (A a , . . . , Aj)', r 2 = (r 2 , . . . , rj)' and a 2 = 
(cr 2 , . . . , crj )'. Therefore, the posterior distribution for the model parameters is 

p(6|Y, W, X, Z, Q) oc P (e)p(Y\e, W, X, Z, Q). 



The complexity of the sampling model requires the use of MCMC algorithms for 
statistical inference. Unfortunately, the commonly used priors in probit and linear 
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mixed effects models along with a run-of-the-mill sampling scheme lead to a torpidly 
mixing chain. In the next section, we discuss the prior specifications and the algo- 
rithmic modifications that we have implemented to improve the MCMC efficiency. 



3.2 MCMC sampling 



We follow the data augmentation (DA) principle of Tanner and Wong (1987) and 
sample alternatively from the posterior distribution given the complete data and 
from the LV's distribution given the observed data and the parameter values. We first 
discuss the sampling scheme for the relatively simple case when all the phenotypes 
are continuous and we then extend the algorithm to binary phenotypes. 



3.2.1 Continuous Traits 



When all the phenotypes are continuous and the conditional conjugate priors are 
defined for the model parameters, a standard Gibbs (SG) sampler can be used for 
MCMC sampling from the posterior distribution. However, due to high dependence 
between the components of the Markov chain corresponding to the parameter vector 
© and the missing and latent data vector Q, we observe a very slow mixing of the 



chain. Improvement can be obtained by using hierarchical centering (HC) (Gelfand 



1995). The HC technique moves the parameters up the hierarchy via model reformu- 
lation. Specifically, in (2.1) we move /3o/s up the model hierarchy to be the mean of 
the random effect b so that the new random effect is &!■„• = floj + b C ij. 



CI] 



Another technique devised to overcome the slow convergence problem of DA 



algorithms is parameter expansion (PX) (Meng and van Dyk, 1999; Liu and Wu 



1999). The idea behind PX is to introduce auxiliary parameters in the model and 



average over all their possible values in order to produce inference for the original 



model of interest. As demonstrated by Meng and van Dyk (1999) and Liu and Wu 



(1999), the benefits of this apparently circuitous strategy can be highly significant, 



because the larger parameter space allows the Markov chain to move more freely and 
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breaks the dependence between its components. 

In our implementations, we have observed a strong coupling between the updates 
of the random effect b C ij and its variance r 2 , and between the factor loading Xj and 
the latent factor U for all 1 < j < J. For instance, an update of r 2 close to zero likely 
yields a small update of b C ij and vice versa. Thus, we introduce auxiliary parameters 
£j and ip and define the following PX with hierarchical centering model (PX-HC) 

Vcitj = WSJ, + X*U* clt + i 3 b* cij + e citj , with (3.1) 

Kix = + Z T at al + Q T at d* cl + e^, (3.2) 

where b* cij ~ N^rg), a* c ^ MVN ql (0, E*), d* ci MVN q2 (0,E* d ), and e* cU S3 
N(0,ip 2 ). To relate the original parameters to the expanded parameters, the fol- 
lowing transformations are used 

a = a *M u clt = t/* t M s A = sy^ 2 , S D = s^/^ 2 , 

A, = A^, P j0 = P* j0 &, rj (;'/;• for all 1 < j < J. 

The conjugate priors used for the auxiliary parameters involved in the PX scheme 
lead to a Gibbs sampler in which each conditional distribution is available in closed 
form and induce the folded-t (the absolute value of t distribution) prior distributions 
for the parameters r and A. More precisely, since Var(6* ij ) = rjj in our PX-HC model 
and Var(6 C jj) = r 2 in the original model, we have Tj = \^j\T]j. When the conditional 
conjugate normal and inverse-Gamma prior are applied to £j and ?7 2 , respectively, the 
resulting prior for Tj is the folded-t distribution. Similarly, since Aj = X*ip, a half 
normal prior assigned to A* and inverse-Gamma prior to ip 2 wm result in a folded-t 
prior for Aj. Other authors have discussed the suitability of folded-t priors in mixed 



effects and factor analysis models. For instance, Gelman (2006) noted the added 



flexibility and improved behaviour when random effects are small, and Ghosh and 



Dunson (2009) suggested the use of t or folded-t priors for the factor loadings in a 
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factor analysis setting. 

We consider independent and conjugate priors for the expanded model param- 
eters 0* = (/5q, (3, a*, A*, r] 2 , a 2 , £*, EJj, ip, £)'. For example, we use normal priors for 
the fixed-effect coefficients. Thus, 

13* j ~ N{0, 1000), fa ~ A/" Pl (0, 1000 * I Pl ), 77? _ I nv -Gamma (f , f ), for 1 < j < J; 
a ~ Ap 2 (0,1000 *I P2 ). 

For the scale parameters, we specify conditional conjugate priors, 

a 2 ~ lnv-Gamma(0.1, 0.1) for j = 1, . . . , J; 

S* ~ Inv-Wishart(K, l ) with K = <?i + 1 and 5 a = 10 * I,i; 
~ Inv-Wishart(Vd, S d l ) with V d = q 2 + 1 and S d = 10 * I g2 . 
For the auxiliary parameters we have introduced, the priors are 

i) 2 ~ Inv-Gamma (f , f ) , £ ~ iV(0, 1), 

where v\ and t>2 are the hyperparameters representing the degrees of freedom (df) of 
the induced folded-t priors for Xj and r 2 , respectively. 

For the purpose of phenotype selection, which will be discussed in detail in Sec- 
tion 4.1, we specify a spike- an d-slab prior for A* 

p(A*) = (1 - tt^o} + 7r,TN + (0, 1), (3.3) 

with the hyperparameter 11 j ~ Beta(a, b). Notice that the induced prior for A in the 
original inference model is also a spike-and-slab prior with the spike at zero and the 
slab distribution equal to a folded-t distribution with v\ df. 

After defining the conditional conjugate priors for the expanded model, we now 
can apply the Gibbs sampling to obtain the posterior samples. Details of sampling 
are described in the Appendix, and the key steps at a given iteration k — 1 are: 
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Step 1: Draw 6*( fc ) from /(6*|fi* (fc 1 \y c ), which involves sampling ((3[, ...,(3j)' , (3q, 
a*, A*, r/ 2 , a 2 , E*, E*, ^, f , and 

Step 2: Draw from /(fi* 1 6* (fc) ,t/ c ), which involves sampling fi* = (U* ,b* , a* , d*)' . 
3.2.2 General Traits 

Suppose that the phenotypes of interest includes both continuous and binary traits. 
Without loss of generality, we assume that the first J\ phenotypes are continuous and 
the remaining ones are binary. In order to address concerns involving the MCMC 
mixing similar to those in the continuous response case, we define the model 

y c a tj = WStfij + X*U* cU + + e cltv l<j< J u (3.4) 

p(j& tj = 1) = HWj^j + \*U* cit + i^), Ji + l<j<J, (3.5) 

Kix = ^ u a* + Z T clt a* c + Q T clt d* cl + e* clt , (3.6) 

where b* clJ ~ N(f3* 0pV ]), a* c ^ W^PJ, d* cl ^ MVN q2 (0, E* d ), e* clt S5 iV(0,^ 2 ), 
and the prior distributions are the same as in the continuous case. 

A specific issue encountered here is that some of the conditional distributions 
required in the Gibbs sampler are not available in closed form. One possible solution 
is to employ the DA scheme proposed by Albert and Chib| (1993) that uses the 



underlying continuous variable y b cit ^ (discussed in Section 2.1) as an auxiliary variable, 
so that all the conditional posterior densities in the expanded model can be directly 



sampled from. However, even in the model defined by (3.4) — (3.6) we have noticed 
strong posterior dependence between y h cit - and some of the model parameters which 
triggered a sluggish mixing of the chain. We apply an additional layer of the PX 

~rb* 
citj 



scheme by introducing a working parameter and a one-to-one mapping y h * 



IjVcitj- We use the marginal DA scheme 3 of van Dyk and Meng (2001), leaving the 
prior distributions and the parameterization of the model unchanged. We call this 
algorithm the doubly parameter- expanded with hierarchical centering and denote it as 
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PX 2 -HC. Below we summarize the fcth iteration in the Gibbs sampling algorithm, 
and we provide a complete description in the Appendix. 

Step 1: Draw 

~*(k) f ™ffe;,l), if = 1 

where ^ = W^j*^ + \f' 1] uf " 1} + Cf^fe^. Transform y*y to ^ 
via^ij = ljy b city 

Step 2: For j — Ji + 1, J, the order of updating , A* , £j , 7^ ) involves 
sampling first 7^" and then (/3j , A* , £^ ) from their respective conditional 
densities. Set fif = ^/ 7 f , Aj W = , and £f = § fc) /% (fc) - 

The updates of the remaining parameters for the continuous responses and 
for the second part of the LVM are the same as those for the case with only 
continuous responses (see Section 3.2.1 and the Appendix). 

Step 3: Draw tt*W from /(fi*|6*W, y c , y**^). 

Figures 1 and 2 illustrates the significant improvement, in terms of reduced autocor- 
relation, brought by the additional layer of parameter expansion. See Section 5 for 
details of the simulations and the Appendix for additional comparison plots. 



4 Model and Variable Selection 



4.1 Selection of Relevant Phenotypes 

In medical research, it is of interest to determine if a phenotype under the study is 
truly relevant to the latent disease status. In the LVM setting, this is equivalent to 



testing if the coefficient or factor loading Aj for the j phenotype ((2.1 ) for continuous 



phenotype and (2.2) for binary phenotype) is statistically significant or not. 
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The sign restriction on the factor loadings, Xj-s > as discussed in Section 2.2, 
implies that the highest posterior density interval (Hpdl) will seldom include zero 
and is thus anti-conservative as a selection criterion. To assess the significance of the 
factor loadings, we apply the Bayesian model selection method using the spike-and- 



slab priors (Mitchell and Beauchamp, 1988; George and McCulloch, 1993 Chipman 



1996 



Xu et al. 2011). Specifically, we use the spike-and-slab prior (3.3) for each A* 



The point mass at zero shrinks small values of the factor loading towards zero, while 
the values of a and b reflect the prior belief in Xj = (see Xu et al. 2011, for a 
similar discussion on the use of spike-and-slab priors to alleviate the winner's curse 
in genetic association studies). When there is no prior information, we recommend 
a = 1 and 6=1 which correspond to ttj ~ Unif(0, 1). For those phenotypes that are 
a priori considered to be unrelated to the latent disease variable, we can set a = 0.25 
and 6 = 1, thus favouring a priori small values for the corresponding XjS. 

The determination of the relevance of the j th phenotype is based on the posterior 
probability of positive loading, Pr(Xj > 0|1^). The sampling algorithm discussed in 
the Appendix introduces the latent mixture indicator cjj = l^x)}, so that Pr(Xj > 
0\Y) can be approximated by the MCMC sample frequency of {u>j = 1}. Given a 
pre-specified threshold 0, we assume that any loading with Pr(Xj > 0\Y) > <p should 
be included in the model. The value for depends on the practical problem. If the 
number of manifest variables is large, the <fi value can be chosen to control the overall 



average Bayesian false discovery rate (FDR) (Morris et al. 2008). The performance 
of selecting the correct phenotypes is shown in the simulation Section 5.3.1 below. 



4.2 Model Selection Using Bayes Factor 



Bayes factors are central to the Bayesian model selection and comparison. The Bayes 
factor for comparing model M and M\ is defined as: BF = p/y ^ . Assuming equal 
model priors for M and M 1; the posterior odds of the two models equals to the Bayes 



factor. A calibration of the Bayes factor is given by Kass and Raftery (1995) where 
log-BF > 1 supports Mi and logi?F < supports M . 
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The calculation of P(Y\Mk) in our setting is challenging as it involves high 



dimensional integration. We follow the procedure of Lee and Song (2002) and use the 



parametric arithmetic mean path (PAMP) implementation of the path sampling (see 



also Dutta and Ghosh, 2012) to compute the Bayes factor. Specifically, we assume 
an unnormalized density function f g such that f 9o and fe l are the sampling densities 
for models M and Mi, respectively. The two models are connected in the parametric 
space via a path V = {9 g = g9 + (1 — g)6\ : g e [0, 1]}, where each 9 g e V 
corresponds to a model M g for which the sampling density function is f(0 g ). The 
Bayes factor can then be calculated using the identity 

logBF = log p(y |mo) = /Ve[u(y,n,e I p)]^ I (4.1) 

where -En,e denotes the expectation with respect to the density p(fl, Q\Y, g), and 
U(Y, £l,Q,g) = logp(Y, 0|©, g). The dependence of p(Y,il\Q, g) on g is due to 

P (Y,n\e, g ) K P (n\Y,e)f e (Y\e). 



To compute the integral in equation (4.1), we choose S fixed values {<?i, . . . ,gs} 



such that (7(0) = < g(\) < g^) < ... < g^s) < 1 = 9(s+i) an d then estimate log BF by 
1 5 _ _ 

logBF = - Jj9(s + i) - ^))(U (a+1) + U (s) ), (4.2) 

s=0 

where U( s ) is the average of the values U(Y, Q, O, g>( s )) over all the MCMC samples 
from p(fi, 0|y, ^( s ))- That is, 

_ 1 M 

fc=i 

in which {Q^ k \Q^;k = 1,...,M} are the samples draws from p(Q, Q\Y, g( s )). To 
estimate log-BF, we run the PX— HC (or PH 2 — HC) sampling algorithm for each 
of the grid points, calculate the values of the parameters and, finally, compute U. 
This method is also called Path Sampling with Parameter Expansion (PS— PX) by 
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Ghosh and Dunson (2009). In the context of pleiotropy studies, Bayes factors can 
be applied to different hypotheses. In the remaining part of this section we illustrate 
two instances of substantial interest. 



4.2.1 Selection of Relevant Phenotypes via Bayes Factors. 



We are concerned here with assessing the factor loadings, XjS, as discussed in Section 
4.1. Suppose that we are interested in testing Xj = 0. Let M g be the model with fac- 
tor loadings equal to (A 1; . . . , Xj -i,gXj , \j +i, • • • , Aj)', where g G [0, 1]. Thus, model 
M has (Ai, . . . , A io _i, 0, \ jo+ i, . . . , Xj)', and M l has (Ai, . . . , A jo _i, X jo , X jo+1 , Xj)'. 
The first part of the LVM for M g that links the phenotype YjS and the LV U is 



V&tj — Poj + ^citPj + ^jUdt + bdj + e c itj, 

Vtitjo = Poj + WcItA? + 9^j Udt + b C ij + e c uj , 



if j ^ Jo 



(4.4) 



where V c uj = Y cit j for continuous phenotypes and Vdtj = YjHtj for binary phenotypes. 



The second part of the LVM remains unchanged. We then have 



c N, 



U(F, Q, 6, g (s) ) = -j- Yl Yl [ VcS *Jo ~ ^°io + W*tPio + 9(s)X J0 U cit + ba jo )] X jo U cit , 



JO c =l 1=1 t=l 



where cr? = 1 for the binary phenotypes. The logBF can then be obtained via 



Equations (4.2) and (4.3) with U calculated using the above equation. 



Dutta and Ghosh ( |2012 ) showed that the calculation of Bayes factor is valid when 



lim p ^ i?n,e[U(Y, Q, 0, g)] is finite. The condition holds when the prior distribution 
for A has the first two moments finite, which is true for our relatively diffuse folded-t 
prior with ten df. To explore the sensitivity of the Bayes factor estimation to the 
changes of df in the folded-t prior, we also consider df G {3, 10, 40, 90}. 



Another concern is the tuning of the grid sizes when using equation (4.2) as the 



approximation of logBF. Dutta and Ghosh (2012) suggested a grid size no bigger 



than 0.01, corresponding to 100 grid numbers in [0,1]. However, due to conflict 
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between the prior and sampling distributions, when A is large, Eq i q\U(Y, fi, 0, g)] 
has large spikes when g is close to zero, but it stabilizes gradually as g increases. 
Thus we suggest to use uneven grid size scheme so that smaller grid sizes are chosen 
for g near zero. For example, in our simulations below we set the grid number to be 
15 for g E [0, 0.1] and also 15 for g G [0.9, 1] and observed good results. 



4.2.2 Selection of Pleiotropic Genetic Marker via Bayes Factors 

An important inferential focus in genetic pleiotropy study is the selection of genetic 
marker(s) with pleiotropic effect. Particularly, we are interested in testing the effect of 
a genetic marker on the LV. Suppose that the fixed-effect covariates X in the second 
part of the LVM has two components where X\ is the set of clinical covariates and 
X2 is the genotype of the marker of interest. The two competing LV models are then 

Vlitj = Ay + W£ t {3j + XjUdt + bdj + e cit j, j = 1, Ji, 
A) : { y b ci tj = A), + W? t /3j + XjUdt + bdj + e cltj , j = Ji + 1, J, 
Udt = XT ^ + ZT it a c + Q\\ t d ci . 



Poj + WLfo + XjUdt + bdj + ecttj, j = 1, -, Ji, 



Jcitj - h>0j I" VV dtPj T AjUdt T U ci j T t c itj , 

M i : { Vitj = Ay + W? t 0j + XjUdt + Kij + e C itj: J = Ji + 1, J, 



a ■ 



The two models can be linked up by the parameter g e [0, 1] as: 

Vcitj = Ay + W^Pj + XjUdt + Kij + e ci tj, j = 1, Ji, 
M 9 ■ \ Vcuj = Ay + W^j + XjU cit + bdj + e C itj, J = Ji + 1, J, 
Udt = XT cu ai + gXT cu a 2 + ZT it a c + Q T cit dci- 

Let L g be the complete-data likelihood for model M g , then 

d\ L C N c T 

u(r, n, e, 9{s) ) = = E E E i u <* - X L^ - $w*L«2 - z ^ c - Ql t da) 

y c =i i=\ t=i 



18 



Again log BF can be obtained using (4.2) and (4.3) with U calculated as above. 



5 Simulation study 

We have identified three practically relevant issues related to the model's performance 
that we wanted to study via simulations: 1) the performance of our proposed method 
in parameter estimation, 2) the effect of ignoring the family structure, and 3) the 
performance of the proposed variable selection methods introduced in Section 4. 

Table [T] provides the details of the simulation models in terms of phenotype 
Y and covariates W and X specifications. Without loss of the generality, in all 
simulations we assume that there are 500 families/clusters, each contributing 1, 2, 
3, 4 or 5 siblings with probability, respectively, 0.3, 0.4, 0.2, 0.07 and 0.03. To 
simulate the genotypes for the siblings, we set the minor allele frequency (MAF) 
to be 0.1, and we assume that the parental genotypes for the 500 families follow 
Hardy Weinberg equilibrium(HWE). The siblings' genotypes are then obtained by 
using Mendel's first law of segregation. The parental genotypes however are not used 
for the analysis because they are often not available in practical settings. Continuous 
covariates are assumed to follow a N(0, 1) distribution with correlation coefficient 
within the family (familial correlation) being 0.3 and individual-specific trajectories 
for the covariate (serial correlation) follow an AR(1) model with autocorrelation being 
0.3. We also assume that each phenotype is measured over time five times. Unless 
specified otherwise, the default choice is itj ~ Beta(l, 1) in the spike-and-slab prior 



given in (3.3), assuming no prior information on the relationship between phenotype j 
and the LV. In all the simulations, we run the MCMC algorithm for 25,000 iterations, 
discarding the first 5,000 samples as burn-in. 
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5.1 Parameter estimation for general traits. 



To evaluate the performance of the proposed parameter estimation, we assume here 
that five phenotypes are of interest among which yi, y 2 and y% are continuous, and y\ 
and 1/5 are binary (Tables [l] and [2]). We use a total of 100 replications to study the 
variation of estimates and also compare the performance of our PX 2 — HC algorithm 
with those of the standard algorithms. 

Figure 1 shows that, compared to AC-PX-HC, the standard PX algorithm with 



the DA scheme of Albert and Chib (1993), the proposed PX — HC algorithm dras- 



tically reduces the autocorrelations between the chains's realizations which implies 
an increase in Monte Carlo efficiency The improvement is also evident in Figure 
2, when comparing the standard Gibbs (left column) with the proposed PX 2 — HC 
scheme (right plots). In the Appendix, we provide the corresponding trace plots to 
illustrate the improved mixing of the PX 2 — HC chain. 

Table [2] shows the true values, mean estimates and root mean square errors 
(RMSE) for all the parameters in the model. Results show that the estimates have 
good accuracy with small bias and RMSE. However, parameter estimation for binary 
responses is less accurate than for continuous phenotypes, which is not surprising 
given the discrepancy between the information provided by the two types of data. 

Figure [3] shows that around 95% of the Hpdls for the factor loadings As cover 
the true values. However, this remains to be validated theoretically for more general 
setting as results seem to suggest a matching prior type of result. 

5.2 Effect of Ignoring Family Structure 

In this simulation, we only consider the three continuous phenotypes yi, y 2 and y% 
and generate data with familial and serial correlations as above in Section 5.1. We 
compare parameter estimates obtained using the proposed method to model the fa- 



milial correlation with the estimates obtained using the method of Roy and Lin (2000) 
assuming samples are independent of each other. 



20 



The simulation results in Table [3] show that failure to account for the familial 
correlation present in the data will not only yield incorrect conclusion of the serial 
correlations between phenotypes but also cause over-estimation of the factor 
loading Xj, which quantifies the strength of the relationship between phenotype yj 
and the latent variable U, and under-estimation of a.\ and which respectively, 
represents the effect of the clinical covariate and genetic marker on U. 



5.3 Model and Variable Selection 

5.3.1 Selection of Relevant Phenotypes via Spike-and-Slab Priors 



To assess the performance of phenotype selection using the spike-and-slab prior (3.3) 
as described in Section 4.1, we consider here four continuous phenotypes y c with 
factor loading XjS being set to (0.5, 0.05, 0.02, 0) and three binary phenotypes y b with 
XjS being (0.2, 0.01, 0). We chose these values so that the strength of the association 
between a phenotype and the LV U ranges from strong (Xj = 0.5) to no association 
(Xj = 0). We set tt ~ Beta(0.25, 1) which favours a priori the null (Xj = 0). After 
the calculation of the posterior frequency of Uj = 1 for each phenotype in the 100 
replicates, a threshold of (ft = 50% is used on the posterior probability Pr(Xj > 0\Y). 

Simulation results show that the type I error of this selection procedure is about 
5% in that, for the phenotype not associated with the latent disease status (Xj = 0), 
around 5% of the replications have Pr(Xj > 0\Y) > 0.5. For phenotypes moderately 
or strongly associated the latent variable (Xj = 0.05 or higher), we have 100% power 
of making the correct decision. However, power is reduced for weakly associated 
phenotypes as expected. Specifically for the case consider here, power is 71% for the 
phenotype with Xj = 0.02 and 16% for Xj = 0.01. 



5.3.2 Selection of Relevant Phenotypes via Bayes Factors 

We have also examined the performance of phenotype selection via Bayes factors for 
the comparison of model Mi (assuming A > 0) and model M (assuming A = 0) using 
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the same simulation model as in the previous section. The logBFs for the A's are 
calculated using the folded-t prior with df = 10 as described in Section 4.2.1. 

Results in Table|4]show that when phenotype yj is truly associated with the latent 
variable with Xj > 0.05, the average estimated log(BF) > 5.9. This and combined 
with the SD shown in the table suggest that, for such cases, the Bayes factor criterion 
chooses the correct alternative model 100% of the times. When phenotype yj is truly 
not associated with the LV (Xj = 0), the Bayes factor criterion correctly favors the 
null model with the average estimated log(BF) < —2.4. However, the Bayes factor 
criterion has little ability in identifying weakly associated phenotypes (Xj < 0.02), 
which is inferior to the spike-and-slab approach in the previous section. We have also 
explored the use of other folded-t priors with df ranging from 3, 10, 40 to 90. We find 
that all the prior settings give us the same conclusion on the model selection. 



5.3.3 Selection of Pleiotropic Genetic Marker or Other Covariates 

For this purpose, we assume that there are five indirect fixed-effect covariates X 1; . . . , X 5 
in the second part of the LVM that models the effect of the genetic marker and other 
covariates on the latent variable. We assume that X 3 is the genotype of the marker 
of interest and the remaining ones are standardized continuous variables. Coeffi- 
cients (ai, «2, «3, «4, a$) quantify the effects of the five covariates which are set to 
be (1.0,-0.5,0.2,0,0). We also assume that there are two continuous standardized 
direct fixed-effect covariates W\ and W2 in the first part of the LVM with effects 
= 0.5 and (3 2 = 0.3) on all the phenotypes. 

Table [5] shows the estimated log(BF) for comparing the null model assuming 



a.j = with the alternative of association. The prior distribution for a in (2.4) is 
a ~ iV(0, 15). Results show that the proposed Bayes factor criterion has the ability to 
detect the association between the genetic marker and the latent variable, therefore 
pleiotropy, with the average estimated log(BF) = 6.3 and SD = 2.39. The Bayes 
factor variable selection criterion also has good result for the two associated covariates 
Xi and X2 for which the average log(BF) estimate is, respectively, 337.5 and 6.3, as 
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well as for the two covariates X4 and X5 with zero effect, for which the estimated 
log(BF) is consistently less than —1.9 with SD = 0.42. 



6 Application to a genetic study of type 1 diabetes 
(T1D) complications. 

Here we demonstrate the practical utility of the proposed LVM method by investigat- 
ing the blood pressure data from a genome- wide association study (GAWS) of various 



T1D complications (Paterson et al. 2010). The study sample consists of n = 1300 



individuals with T1D from the Diabetes Control and Complications Trial (DCCT). 
Various phenotypes thought be to related to T1D complication severity, including 
glycosylated hemoglobin (HbAlc) and diastolic (DBP) and systolic blood pressure 
(SBP), were collected from each subject over the course of the DCCT. Additional co- 
variates such as sex and body mass index (BMI) were also collected, and individuals 
were from two different cohorts and subjected to two treatment types (conventional 
vs. intensive). Over 800K SNPs were genotyped by the Illumina 1M bead chip assay 
for these individuals. 

Because T1D is a complex disease with various complication measures (the ob- 
served phenotypes), it is of great interest to quantify the conceptual latent complica- 
tion status, as well as to understand the influencing factors (both genetic markers and 
clinical covariates). In addition, it is valuable to determine if the various observed 
phenotypes are truly associated with the latent variable. However, due to lack of 
suitable statistical methodology, previous analyses have been limited to the standard 



uni-phenotype approach, analyzing one phenotype at a time. For example, Ye et al. 



(2010) recently performed GAWS, separately, for DBP and SBP, and they identified 
rs7842868 on chromosome 8 as a SNP significantly associated with DBP. 

Our goal here is to formally perform a multi-phenotype analysis, jointly analyzing 
DBP and SBP using the proposed Bayesian LVM methodology. This approach allows 
us not only to determine if rs7842868 is associated with the latent conceptual T1D 
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complication variable, but also to test if DBP and SBP are truly related to the LV. 
It is also of practical interest to study whether there are other phenotypes such as 
hyperglycemia (HPG) influence the latent complication severity variable. 

In our application, we investigate three phenotypes Y among which two are 
continuous (DBP and SBP) and one is binary (HPG = 1 for hyperglycemia if and 
= for normal glycemia), all are longitudinal. Hyperglycemia at a given time point 
is defined if the corresponding HbAlc is greater or equal to 8. Among the available 
covariates, based on suggestions from clinicians, covariates W that have direct effects 
on the phenotypes include BMI, and covariates X that have direct effects on the 
LV include sex, cohort and treatment. Among the 10 longitudinal measurements 
available, there are significant amount of missing data after the 7 th measure (due to 
staggered entry) while there are little missing data before the 5 th measure. Therefore, 
we only use the first five measurements. We treat the remaining missing data as 
Missing at Random (MAR), and we replace the missing data with the means of all the 
other measurements. In this dataset, there is only one person in each family therefore 
there is no familial correlation, but the proposed methodology remains suitable by 
assuming the cluster size being 1. 



We first consider rs7842868, a SNP found by Ye et al. (2010) to be associated 
with DBP. In this case X\ is the genotype of rs7842868. Results in Table [6] show 
that DBP and SBP are clearly associated with the latent variable with estimated 
logBF over 100, while HPG is not. We also applied the spike-and-slab prior method 
for phenotype selection as discussed in Sections 4.1 and 5.3.1, the poster probability 
Pr(\j > 0\Y) is 1 for DBP and SBP and 0.235 for HPG. Thus all model selection 
criteria consistently suggest that both DBP and SBP are significantly related to the 
latent variable but not HPG. Results also show that SNP rs7842868 is significantly 
associated with the latent variable with estimated logBF over 10 and the 95% Hpdl 
not covering 0. The sign of the effect suggest that the minor allele of the SNP is 
protective in that it decreases the latent complication severity score. The combined 
evidence from both parts of the LVM show that rs7842868 has pleiotropic effect on 
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the two blood pressures. Sex and cohort are also found to be significantly associated 
with the latent variable but not treatment. 



We then investigate rsl358030, a SNP found by Paterson et al. (2010) to be 
associated with HbAlc. In this case, X\ is the genotype of rsl358030. Based on 
results in Table [6j there is no evidence that rsl358030 is significantly associated with 
the latent variable. Our application here considers the binary hyperglycaemia (HPG) 
as the third phenotype of interest instead of the continuous HbAlc variable. Besides 
clinical consideration, this choice also allows us to evaluate the proposed method for 
general traits as described in Section 3.2.2 

To further evaluate the proposed method, we simulate genotypes for two NULL 
SNPs that are not associated with the phenotypes of interest. One SNP has MAF 
equal to 0.25, the MAF of rs7842868), and the other one has MAF equal to 0.35, the 
MAF of rsl358030. As expected, no significant associations are detected. 



7 Conclusion and Disucssion 

We propose here a Bayesian latent variable approach to joint model multiple out- 
comes, motivated by genetic association studies of pleiotropic effects. The method 
can handle continuous and binary responses while accounting for serial and famil- 
ial correlation structures in the data. The postulated latent variable represents the 
underlying severity or complication level of a trait and characterizes the totality of 
multiple observed phenotypes of interest. If additional phenotypes were to be mea- 
sured, the latent variable could change its significance since it would encapsulate a 
richer set of manifest variables. The central feature of the model is that it allows 
us to consider the strength of dependence between the genotype and each of the 
phenotypes in a unified manner. The Bayesian approach takes into account all the 
uncertainty present in the model and incorporates prior information if available. The 
computational challenges are met via the use of parametric expansion techniques. 

An important issue for pleiotropy studies is the assessment of importance of each 
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variable. We have adopted two Bayesian techniques that are shown to be effective in 
variable selection. We found that both Bayes factor and spike-and-slab prior perform 
well with the latter slightly more efficient in terms of detecting weak signals. 

Our proof-of-principle application to a genetic study of type 1 diabetes compli- 
cations demonstrates the utility of the method in a real data setting. So far, genetic 
association studies of various T1D complication-related measures have been limited 
to studying one phenotype at a time. The proposed method jointly analyzes two 
continuous and one binary phenotypes of interest, and it provides evidence for the 
association between the phenotypes and the latent severity of T1D complication, the 
association between the latent variable and genetic markers of interest, and the effect 
of other covariates on the phenotypes and the latent variable. 

The computational load in the current implementation of the proposed method 
makes it impractical to perform a genome-wide search for pleiotropic genetic variants. 
The recent advances in parallel computing can partially alleviate this constraint. 
Alternatively, a two-stage approach can be used in which a simple and less stringent 
selection procedure is first used to select a moderate number of candidate variants 
for further investigation using the proposed method. The uncertainty inherited from 
the first-stage selection, however, must be accounted for in the models used in the 
second stage. 
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Table 2: Performance of the proposed parameter estimation method. First three phe- 
notypes y 1 , y 2 and y 3 are continuous and the last two phenotypes y 4 and y 5 are binary. 
Results are based on 100 replicates that were simulated as described in Section 5.1 and 
Table 



First part of the latent variable model 



Parameter 


True value 


Estimate 


RMSE 


Xj, the factor loading 


for phenotype yj and the latent variable U 


Ai 


1.0 


1.000 


0.007 


A 2 


1.0 


1.000 


0.007 


A 3 


1.0 


1.000 


0.007 


A 4 


1.0 


1.000 


0.022 


A 5 


1.0 


0.998 


0.025 


/3j for phenotype yj and covariate \ 


V 


Pi 


1.0 


1.000 


0.011 


P2 


1.0 


1.000 


0.011 


Ps 


1.0 


1.000 


0.011 


Pa 


1.0 


1.005 


0.026 


P 5 


1.0 


0.996 


0.031 


Second part of the latent variable model 


Parameter 


True value 


Estimate 


RMSE 


a for the latent variable U and covariates («2 for the genetic marker X%) 




1.0 


0.999 


0.011 




1.0 


1.011 


0.074 


Correlation parameters 


Parameter 


True value 


Estimate 


RMSE 




0.2 


0.200 


0.014 




0.2 


0.201 


0.012 


ri 


0.2 


0.200 


0.013 


rl 


0.2 


0.202 


0.033 


ri 


0.2 


0.200 


0.032 


Sa 


0.5 


0.519 


0.046 




0.3 


0.320 


0.032 
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Table 3: Effect of ignoring familial correlation present in the data. Results are the 
averages over 100 simulation replications. The coefficient a 2 evaluates the effect of 
a genetic marker on the latent variable U {a\ for a clinical covariate on U), and Xjs 
evaluate the effect of U on the three pheno types of interest. Details of the simulations 
are in Section 5.2 and Table Q] 









With Familial Corr. 


Without Familial Corr. 


Parameters 


True Value 


\\\ Pi c; 

1—) 1 < X o 


SD 


RMSE 

X VjlV-LkJ J—J 


Rl PI t; 

J-JldQ 


SD 


RMSF 




HI 


1.0 


-0.002 


0.018 


0.019 


-0.003 


0.028 


r\ ao'7 

0.027 






1.0 


-0.001 


0.020 


0.019 


-0.002 


0.024 


0.028 




fa 


1.0 


-0.002 


0.019 


0.019 


-0.003 


0.027 


0.028 


a 


(X\ 


1.0 


0.009 


0.074 


0.075 


-0.233 


0.099 


0.235 


a 2 


1.0 


0.009 


0.020 


0.021 


-0.227 


0.028 


0.244 




Ai 


1.0 


-0.009 


0.014 


0.015 


0.308 


0.029 


0.310 


A 


A 2 


1.0 


-0.009 


0.014 


0.016 


0.308 


0.028 


0.309 




A 3 


1.0 


-0.009 


0.015 


0.015 


0.308 


0.029 


0.309 




rl 


0.2 


-0.002 


0.013 


0.015 


-0.001 


0.015 


0.015 


T 2 


rl 


0.2 


-0.001 


0.012 


0.014 


-0.001 


0.013 


0.015 






0.2 


-0.002 


0.012 


0.013 


0.000 


0.014 


0.014 






0.1 


0.002 


0.004 


0.004 


0.002 


0.004 


0.004 


a 2 




0.1 


0.001 


0.004 


0.004 


0.001 


0.004 


0.004 




°l 


0.1 


0.002 


0.004 


0.004 


0.002 


0.004 


0.004 




(Sa)ii 


1.0 


-0.003 


0.08 


0.089 


N/A 


N/A 


N/A 




(Ea)i2 


0.0 


0.004 


0.06 


0.064 


N/A 


N/A 


N/A 




(S^)22 


1.0 


0.034 


0.08 


0.085 


N/A 


N/A 


N/A 




(Sij)ii 


0.1 


0.049 


0.020 


0.053 


0.69 


0.06 


0.696 




(^d)i2 


0.0 


0.000 


0.011 


0.011 


-0.002 


0.022 


0.023 




(^d)22 


0.1 


0.029 


0.023 


0.033 


0.047 


0.021 


0.052 
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Table 4: The estimated logBF for testing the factor loading Xj. Xj quantifies the 
association between phenotype t/j and the latent variable. Results are based on 50 
simulated replicates. Details of the simulations are in Section 5.3.2 and Table [TJ 





Continuous Phenotypes 


Binary Phenotypes 




V\ 


V2 


V3 


y* 


V5 


2/6 


V7 


True Xj 


0.5 


0.05 


0.02 





0.2 


0.01 





logBF 


121.60 


5.91 


-1.76 


-2.67 


22.11 


-2.43 


-2.43 


SD 


6.47 


2.22 


1.05 


0.47 


8.19 


0.32 


0.13 



Table 5: The estimated logBF for testing a = 0. a quantifies the association between 
the covariates and the latent variable. Results are based on 50 simulated replicates. 
Details of the simulations are in Section 5.3.3 and Table [U 

Covariates with effect on the latent variable 





X x 


x 2 


X 3 (genotype) 


x 4 


x 5 


True a 


1.0 


-0.5 


0.2 








logBF 


337.53 


94.99 


6.30 


-1.94 


-1.95 


SD 


33.25 


11.19 


2.39 


0.46 


0.42 
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Table 6: Application results. SNP rs7842868 was previously identified to be associated 
with diastolic blood pressure (DBP) and SNP rsl358030 was previously identified to 
be associated with HbAlc. Phenotypes of interest are DBP and systolic blood pres- 
sure (SBP), two continuous outcomes, and hyperglycemia (HPG, defined as HbAlc 
greater or equal to 8), a binary outcome. All phenotypes are thought to be related 
to type 1 diabetes complication severity. The coefficient As assess the association 
between the phenotypes and the latent T1D complication status, and as evaluate 
the association between the latent variable and the genetic marker and the other 
covariates of interest. See Section 6 and Table 1 for more details. 



Analysis of SNP rs7842868 




Parameter 


Estimate 


95% Hpdl 


logBF 


SBP 


Ai 


6.621 


(6.153, 7.077) 


114.85 


DBP 


A 2 


3.842 


(3.566, 4.110) 


112.98 


HPG 


A 3 


0.011 


(2.189 x 10~ 7 , 2.975 x 10~ 2 ) 


-1.05 


rs7842868 


ai 


-0.269 


(-0.372, -0.164) 


10.06 


sex 


a 2 


-0.721 


(-0.866, -0.584) 


62.27 


cohort 


a 3 


0.443 


( 0.299, 0.585) 


20.15 


treatment 


a4 


0.128 


(-0.004, 0.263) 


0.366 


Analysis of SNP rs!358030 




Parameter Estimate 


95% Hpdl 


logBF 


SBP 


Ai 


6.868 


(6.439, 7.302) 


128.3 


DBP 


A 2 


3.706 


(3.491, 3.933) 


120.2 


HPG 


A 3 


0.010 


(2.566 x 10- 7 , 2.740 x lO" 7 ) 


-1.034 


rsl358030 


CCi 


- 0.039 


(-0.049, 0.122) 


-1.104 


sex 


a 2 


-0.758 


(-0.880, -0.623) 


64.86 


cohort 


a 3 


0.393 


(0.258, 0.532) 


18.17 


treatment 


a4 


0.088 


(-0.041, 0.220) 


-0.18 
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Figure 1: Autocorrelation functions (ACF) for the AC-PX-HC (dashed line) scheme 
and the proposed PX 2 -HC sampling scheme (solid line), averaged over 100 simulation 
replicates. The autocorrelation is based on the posteriors of the factor loadings A4 (left 
panel) and A5 (right panel) for the two binary phenotypes as simulated in Section 5.1 
and described in Table 1. 



Comparison of ACF plots for A. 4 Comparison of ACF plots for l s 



Li. 
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Figure 2: Autocorrelation functions (ACF) for the standard Gibbs (SG) sampling 
scheme (left panel) and the proposed PX 2 -HC sampling scheme (right panel) , averaged 
over 100 simulation replicates. The autocorrelation is based on the posterior draws 
for the three factor loadings XjS (j = 1,2,3) for the three continuous phenotypes, as 
discussed in Section 5.1 and described in Table 1. 



acf plot for k\ using SG scheme. 



10 20 



acf plot for k 2 using SG scheme. 



10 20 



acf plot for k 3 using SG scheme. 



10 20 



acf plot for Xi using PX-HC scheme. 



acf plot for A.2 using PX-HC scheme. 



acf plot for A, 3 using PX-HC scheme. 
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